import scanpy as sc
import scrublet as scr
import numpy as np
import os
import pandas as pd
import polars as pl
import matplotlib.pyplot as plt
import seaborn as snsIntroduction to scRNAseq analysis using scanpy
Workshop
Introduction
Single-cell RNA sequencing (scRNA-seq)
- scRNA-seq is a powerful technology for sequencing individual cells’ transcriptomes.
- Provides high-resolution insights into cell-to-cell variability.
- Enables study of cellular diversity within tissues.
- Aids in identification of rare cell types.
scRNAseq vs bulk RNAseq
- RNA-seq provides a combined overview of gene expression from all cells, masking individual differences (similar to the combined flavor of a smoothie)
- scRNA-seq provides a detailed profile of each cell’s gene expression (like tasting each fruit separately),
Advantajes and limitations
| Feature | Bulk data | Single-cell data |
|---|---|---|
| Cell resolution | Average of all cells | Individual cell resolution |
| Sample representation | Vector of gene expression values | Matrix of gene expression values |
| Genomic resolution | Higher, depends on sequencing depth | Lower, depends on starting material |
| Cost | Lower | High |
| Computational requirements | Lower | Higher |
| Data size | Lower | Higher |
| Data interpretation | Simple | Complex |
Current trends in scRNAseq studies
- Increased scale
- Increased diversiry
- Multiomics integration
- Spatial context
Sample representation
- In bulk data, each sample is repressented by a vector, where each value is a gene.
- In single cell data, each sample is a matrix, where each row is a gene and each column is a cell.
\[\begin{align} Bulk &= \begin{bmatrix} gene_{1} \\ gene_{2} \\ gene_{3}\\ \vdots \\ gene_{n} \end{bmatrix} \\ \\ Single-cell &= \begin{bmatrix} gene_1, cell_1 & gene_1, cell_2 & gene_1, cell_3 & \dots & gene_1, cell_m \\ gene_2, cell_1 & gene_2, cell_2 & gene_2, cell_3 & \dots & gene_2, cell_m \\ gene_3, cell_1 & gene_3, cell_2 & gene_3, cell_3 & \dots & gene_3, cell_m \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ gene_n, cell_1 & gene_n, cell_2 & gene_n, cell_3 & \dots & gene_n, cell_m \end{bmatrix} \end{align}\]
scRNAseq packages: Scanpy vs Seurat
scanpyis a Python package for single-cell analysis (Wolf, Angerer, and Theis 2018)Seuratis an R package for single-cell analysis (Satija et al. 2015)- Both are:
- User-friendly tools for single-cell analysis
- Open source
- Well-documented
- Widely-used
- Choice depends on:
- Language preference
- Team expertise
- Integration with downstream analysis
- Speed and memory requirements
Hint: A good bioinformatician is not restricted by language. You can use R in Python can be done using the
rpy2package. And Python can be use within R usingreticulate.
Scanpy: the AnnData object
Scanpy’s AnnData is a flexible container designed for single-cell gene expression data and annotations. Key features include:
- Data Matrix: Holds gene expression values with cells as rows and genes as columns. Annotations: Allows adding detailed annotations for both cells and genes.
- Multidimensional Annotations: Supports storing multiple matrices for processed and raw data.
- Unstructured Data: Provides space for additional information like quality metrics or analysis results.
Set up
Go to Google Colab
Open a notebook.
Open an notebook from Github.
Introduce https://github.com/IreneRobles/scrnaseq_workshops and select workshop_notebook.ipynb
If you run notebook set up commands, it should install required packages and download test data.
Download repo
git clone https://github.com/IreneRobles/scrnaseq_workshops.git
cd /scrnaseq_workshopsCreate environment
conda create -n myscanpy python=3.10
conda activate myscanpy
pip install -r requirements.txtRegister jupyter kernel
python -m ipykernel install --user --name myscanpy --display-name "Python (myscanpy)"Download test data
mkdir data
cd data
wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE190nnn/GSE190622/suppl/GSE190622%5Fcount%5Fmatrix%5FAnnotated.csv.gzImport libraries
Scanpy setttings
sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()
sc.settings.set_figure_params(dpi=80, facecolor='white')scanpy==1.9.8 anndata==0.10.5.post1 umap==0.5.5 numpy==1.26.4 scipy==1.12.0 pandas==2.2.1 scikit-learn==1.4.1.post1 statsmodels==0.14.1 igraph==0.11.4 pynndescent==0.5.11
Load tests datasets
Single-cell SMART-seq data from mouse macrophages (Robles-Rebollo et al. 2022)
mf = pd.read_csv("data/GSE190622_count_matrix_Annotated.csv.gz", index_col=0)
adata = sc.AnnData(X=mf.T)
adata.obs["genotype"] = adata.obs_names.to_series().apply(lambda x: x.split("_")[0])
adata.obs["timepoint"] = adata.obs_names.to_series().apply(lambda x: x.split("_")[1])
adata.obs["sample"] = adata.obs["genotype"] + "_" + adata.obs["timepoint"]
adata.var_names_make_unique()
adataAnnData object with n_obs × n_vars = 1362 × 18429
obs: 'genotype', 'timepoint', 'sample'
Doublet detection
- A doublet is a single-cell transcriptome that has been generated by two cells
- They have often have a higher read and gene count
- They can produce confounding results
How do we fight doublets: - Experiment design - Computationally
Scrublet
Scrublet finds doublets based os simulations (Wolock, Lopez, and Klein 2019)
scrub = scr.Scrublet(adata.X, expected_doublet_rate=0.06)
doublet_scores, predicted_doublets = scrub.scrub_doublets()
scrub.plot_histogram()Preprocessing...
Simulating doublets...
Embedding transcriptomes using PCA...
Calculating doublet scores...
Automatically set threshold at doublet score = 0.41
Detected doublet rate = 0.1%
Estimated detectable doublet fraction = 25.8%
Overall doublet rate:
Expected = 6.0%
Estimated = 0.3%
Elapsed time: 16.9 seconds
(<Figure size 640x240 with 2 Axes>,
array([<Axes: title={'center': 'Observed transcriptomes'}, xlabel='Doublet score', ylabel='Prob. density'>,
<Axes: title={'center': 'Simulated doublets'}, xlabel='Doublet score', ylabel='Prob. density'>],
dtype=object))
Adjust threshold
predicted_doublets = scrub.call_doublets(threshold=0.2)
scrub.plot_histogram()Detected doublet rate = 0.7%
Estimated detectable doublet fraction = 50.8%
Overall doublet rate:
Expected = 6.0%
Estimated = 1.4%
(<Figure size 640x240 with 2 Axes>,
array([<Axes: title={'center': 'Observed transcriptomes'}, xlabel='Doublet score', ylabel='Prob. density'>,
<Axes: title={'center': 'Simulated doublets'}, xlabel='Doublet score', ylabel='Prob. density'>],
dtype=object))
Visualise doublets
scrub.set_embedding('UMAP', scr.get_umap(scrub.manifold_obs_, 10, min_dist=0.3))
scrub.plot_embedding('UMAP', order_points=True);Filter out doublets
adata.obs["scrublet_doublet"] = predicted_doublets
adata = adata[adata.obs["scrublet_doublet"].apply(lambda x: x is False)]Quality metrics
Highest expressed genes
Look for suspects: MALAT1, mitochondrial genes, ribosomal genes, componenets of the cytoskeleton, etc.
sc.pl.highest_expr_genes(adata, n_top=20)normalizing counts per cell
finished (0:00:00)
Counts, genes and mitochondrial percent
sc.pp.calculate_qc_metrics computes quality control metrics for each cell.
- Number of counts per cell
- Number of genes per cell
- Percentage of counts that come from mitochondrial genes
# Mitochondrial genes
adata.var['mt'] = adata.var_names.str.startswith('mt-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
jitter=0.4, multi_panel=True, rotation = 90)Different sampes might require different thresholds
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
jitter=0.4, multi_panel=True, groupby='sample', rotation = 90)sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt', color='sample')
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts', color='sample')In this case, cells from different samples are reasonably homogeneous. However, that is not always the cells_per_paper.ipynb
QC filtering
Filter cells based on quality metrics
sc.pp.filter_cells(adata, min_genes=500)
sc.pp.filter_genes(adata, min_cells=5)
adata= adata[adata.obs.n_genes_by_counts < 5000, :]
adata= adata[adata.obs.pct_counts_mt < 15, :]filtered out 1 genes that are detected in less than 5 cells
Normalisation
For simplicity, we will use the simplest method: library size normalisation and log transformation, but there are others Hafemeister and Satija (2019).
Steps:
- Transcript length normalisation:
- Adjusts for differences in transcript length between genes
- Divides the counts in each cell by the length of the transcript
- Not necessary if you sequence a fixed region of the transcript ( 3’ or 5’ in 10X, our case)
- Library size normalisation:
- Adjusts for differences in sequencing depth between cells
- Divides the counts in each cell by the total counts in that cell and multiplies by a scale factor (e.g. 10,000)
sc.pp.normalize_total(adata, target_sum=1e4)normalizing counts per cell
finished (0:00:00)
- Log transformation:
- Logarithm of the normalised counts
- Makes the data more normally distributed
sc.pp.log1p(adata)Highly variable genes
- Genes that show the most variation across cells
- Often the most informative genes
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)extracting highly variable genes
finished (0:00:00)
--> added
'highly_variable', boolean vector (adata.var)
'means', float vector (adata.var)
'dispersions', float vector (adata.var)
'dispersions_norm', float vector (adata.var)
Dimensionality reduction
Goal
- Reducing the number of input variables or features in a dataset.
Why
- Noise Reduction: Focus on the most variance-contributing features, thus enhancing the signal-to-noise ratio.
- Data Visualization: We can plot in 2 or 3 dimensions
- Aid clustering and clasification
- Computational efficiency
Prepare data for dimensionality reduction
The .raw attribute freezes the state of the AnnData object for later use. We can recumerate it by calling .raw.to_adata().
adata.raw = adataFilter by highly variable genes
adata= adata[:, adata.var.highly_variable]
adataView of AnnData object with n_obs × n_vars = 1277 × 3193
obs: 'genotype', 'timepoint', 'sample', 'scrublet_doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'n_genes'
var: 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'sample_colors', 'log1p', 'hvg'
Regress out effects of total counts per cell and the percentage of mitochondrial genes expressed.
sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])regressing out ['total_counts', 'pct_counts_mt']
finished (0:00:02)
Scale each gene to unit variance. Clip values exceeding standard deviation to 10.
sc.pp.scale(adata, max_value=10)Dimensionality reduction techniques
- Principal Component Analysis (PCA): PCA is often the first step in dimensionality reduction for single-cell data. It identifies the principal components that capture the most variance in the data, reducing its dimensionality while preserving its structure as much as possible.
- t-Distributed Stochastic Neighbor Embedding (t-SNE): Popular dimensionality reduction that foccusses on the preservation on local structures
- Uniform Manifold Approximation and Projection (UMAP): Popular dimensionality reduction that foccusses on the preservation on global structures
Principal component analysis (PCA)
PCA is often the first step in dimensionality reduction for single-cell data. It identifies the principal components that capture the most variance in the data, reducing its dimensionality while preserving its structure as much as possible.
sc.tl.pca(adata, svd_solver='arpack')computing PCA
on highly variable genes
with n_comps=50
finished (0:00:29)
sc.pl.pca(adata, color='sample')Varinace explained by each component:
sc.pl.pca_variance_ratio(adata, log=True)UMAP and t-SNE
First, we need to compute the neighbourhood graph:
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=20)computing neighbors
using 'X_pca' with n_pcs = 20
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:00)
Compute UMAP:
sc.tl.umap(adata)computing UMAP
finished: added
'X_umap', UMAP coordinates (adata.obsm) (0:00:01)
Compute t-SNE:
sc.tl.tsne(adata, n_pcs = 20)computing tSNE
using 'X_pca' with n_pcs = 20
using sklearn.manifold.TSNE
finished: added
'X_tsne', tSNE coordinates (adata.obsm) (0:00:03)
Plot UMAP:
sc.pl.umap(adata, color='sample')Plot t-SNE:
sc.pl.tsne(adata, color='sample')Clustering
sc.tl.leiden(adata)
sc.pl.umap(adata, color=['sample','leiden'], use_raw=False)running Leiden clustering
finished: found 11 clusters and added
'leiden', the cluster labels (adata.obs, categorical) (0:00:00)
Marker genes
Genes differentially expressed in each cluster.
sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)ranking genes
finished: added to `.uns['rank_genes_groups']`
'names', sorted np.recarray to be indexed by group ids
'scores', sorted np.recarray to be indexed by group ids
'logfoldchanges', sorted np.recarray to be indexed by group ids
'pvals', sorted np.recarray to be indexed by group ids
'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:01)
pd.DataFrame(adata.uns['rank_genes_groups']['names']).head(5)| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Ccl5 | Mapkapk2 | Ctsd | Sepp1 | Ccnd1 | Tnfsf9 | Gm12155 | Hmgb2 | Hist1h2ap | Ddit3 | Tubb5 |
| 1 | Il1rn | Ebi3 | Laptm5 | Tnfsf9 | Lgals3 | Cxcl2 | Gm13341 | Tuba1b | Hist3h2a | Atf4 | Cdca3 |
| 2 | Isg15 | Il1b | Gm42418 | Nfkbiz | Ifi27l2a | Tnf | Gm42836 | Tnf | Trim17 | Hspa5 | Ube2c |
| 3 | Mndal | Serp1 | Lamp1 | Cxcl1 | Rnh1 | Ccl4 | Gm29216 | Tubb5 | H2afx | Slc3a2 | Birc5 |
| 4 | Saa3 | Ehd1 | Sepp1 | Serinc3 | Lpl | Id2 | Mak | Top2a | Hist1h2aa | Sqstm1 | Plk1 |
Plot some markers:
sc.pl.umap(adata, color=['leiden', 'Ccl5', 'Mapkapk2', 'Ctsd', 'Sepp1', 'Ccnd1','Tnfsf9','Gm12155','Hmgb2','Hist1h2ap','Ddit3','Tubb5'], use_raw=True)Session info
!pip install session-info
import session_info
session_info.show()Looking in indexes: https://pypi.org/simple, https://__token__:****@gitlab.com/api/v4/groups/54410195/-/packages/pypi/simple
Requirement already satisfied: session-info in /Users/ireneroblesrebollo/anaconda3/lib/python3.10/site-packages (1.0.0)
Requirement already satisfied: stdlib-list in /Users/ireneroblesrebollo/anaconda3/lib/python3.10/site-packages (from session-info) (0.8.0)
Click to view session information
----- anndata 0.10.5.post1 matplotlib 3.8.3 numpy 1.26.4 pandas 2.2.1 polars 0.20.13 scanpy 1.9.8 scrublet NA seaborn 0.13.2 session_info 1.0.0 -----
Click to view modules imported as dependencies
PIL 10.2.0 annoy NA anyio NA appnope 0.1.4 arrow 1.3.0 asttokens NA attr 23.2.0 attrs 23.2.0 babel 2.14.0 certifi 2024.02.02 cffi 1.16.0 charset_normalizer 3.3.2 comm 0.2.1 cycler 0.12.1 cython_runtime NA dateutil 2.9.0 debugpy 1.8.1 decorator 5.1.1 defusedxml 0.7.1 exceptiongroup 1.2.0 executing 2.0.1 fastjsonschema NA fqdn NA h5py 3.10.0 idna 3.6 igraph 0.11.4 ipykernel 6.29.3 ipywidgets 8.1.2 isoduration NA jedi 0.19.1 jinja2 3.1.3 joblib 1.3.2 json5 NA jsonpointer 2.4 jsonschema 4.21.1 jsonschema_specifications NA jupyter_events 0.9.0 jupyter_server 2.12.5 jupyterlab_server 2.25.3 kiwisolver 1.4.5 lazy_loader NA leidenalg 0.10.2 llvmlite 0.42.0 markupsafe 2.1.5 matplotlib_inline 0.1.6 mpl_toolkits NA natsort 8.4.0 nbformat 5.9.2 numba 0.59.0 overrides NA packaging 23.2 parso 0.8.3 patsy 0.5.6 pexpect 4.9.0 platformdirs 4.2.0 prometheus_client NA prompt_toolkit 3.0.43 psutil 5.9.8 ptyprocess 0.7.0 pure_eval 0.2.2 pyarrow 15.0.0 pycparser 2.21 pydev_ipython NA pydevconsole NA pydevd 2.9.5 pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pygments 2.17.2 pynndescent 0.5.11 pyparsing 3.1.1 pythonjsonlogger NA pytz 2024.1 referencing NA requests 2.31.0 rfc3339_validator 0.1.4 rfc3986_validator 0.1.1 rpds NA scipy 1.12.0 send2trash NA six 1.16.0 skimage 0.22.0 sklearn 1.4.1.post1 sniffio 1.3.1 stack_data 0.6.3 statsmodels 0.14.1 texttable 1.7.0 threadpoolctl 3.3.0 tornado 6.4 tqdm 4.66.2 traitlets 5.14.1 typing_extensions NA umap 0.5.5 uri_template NA urllib3 2.2.1 wcwidth 0.2.13 webcolors 1.13 websocket 1.7.0 yaml 6.0.1 zmq 25.1.2 zoneinfo NA
----- IPython 8.22.1 jupyter_client 8.6.0 jupyter_core 5.7.1 jupyterlab 4.1.2 notebook 7.1.1 ----- Python 3.10.13 (main, Sep 11 2023, 08:16:02) [Clang 14.0.6 ] macOS-13.3.1-arm64-arm-64bit ----- Session information updated at 2024-03-09 22:37
Inspirations
- Analysis of single cell RNA-seq data
- Course from University of Cambridge Bioinformatics training unit
- In R
- Single cell study database
- Scanpy tutorials